Terminal Bench 2.0

A harbor-native benchmark of 89 expert-crafted tasks measuring how well AI agents master real terminal environments across SWE, ML, security, and data science

Published

September 14, 2025

Keywords: Terminal Bench, terminal-bench 2.0, AI agent benchmark, terminal mastery, Harbor framework, Stanford, Laude, software engineering, machine learning, security, data science, coding agent evaluation, LLM agent leaderboard

Introduction

Most AI benchmarks test what models know. Terminal Bench tests what agents can do — inside a real terminal, with real tools, on real tasks.

Terminal Bench 2.0 is a harbor-native benchmark of 89 high-quality tasks that measure AI agents’ ability to operate autonomously in terminal environments. Tasks span software engineering, machine learning, security, data science, scientific computing, and system administration — requiring agents to install software, debug code, crack hashes, train models, configure servers, and much more.

“Terminal-bench: benchmarks for AI agents in terminal environments. Harbor-native benchmarks to quantify agents’ terminal mastery.” — tbench.ai

graph LR
    A["Traditional Code Benchmarks<br/>(HumanEval, SWE-bench)<br/>Code generation focus"] --> B["Limited to<br/>code editing"]
    B --> C["Terminal Bench 2.0<br/>89 real terminal tasks<br/>Full system mastery"]
    C --> D["Measures true<br/>agent autonomy"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Terminal Bench 2.0?

Terminal Bench 2.0 is the second major version of the terminal-bench benchmark suite — a Stanford x Laude collaboration — designed to evaluate AI agents’ ability to solve complex, real-world tasks inside terminal environments. Unlike benchmarks that test isolated coding ability, Terminal Bench drops agents into full Linux environments and asks them to accomplish goals that a skilled software engineer or system administrator would handle.

Each task provides:

  • A detailed natural language description of the goal
  • A Docker container with the required environment pre-configured
  • Automated verification scripts that check whether the task was completed correctly

Key Characteristics

Feature Details
Total tasks 89
Categories Software engineering, ML, security, data science, scientific computing, system administration, debugging, and more
Difficulty levels Easy, Medium, Hard
Evaluation Harbor-native (via Harbor framework)
Metric % Resolved — percentage of the 89 tasks fully completed
Anti-contamination Canary string embedded in benchmark data
Versions 1.0 (legacy), 2.0 (live), 3.0 (in development), Science 1.0 (in development)

How Evaluation Works

Terminal Bench uses the Harbor framework for evaluation. Agents are given access to a terminal environment and must complete each task autonomously. The evaluation command is straightforward:

harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5

Each task has automated verification that checks the agent’s work against precise success criteria — file contents, service availability, test outcomes, or computed results.

graph TD
    A["Task Description<br/>(natural language)"] --> B["AI Agent"]
    C["Docker Container<br/>(pre-configured environment)"] --> B
    B --> D["Agent works in<br/>terminal autonomously"]
    D --> E["Automated Verification<br/>(success criteria checks)"]
    E --> F["Pass / Fail"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#2c3e50,color:#fff,stroke:#333

Who Built It?

Terminal Bench is a Stanford x Laude collaboration. The benchmark tasks were crafted by experts including:

  • Nicholas Carlini — Google DeepMind researcher, prolific task creator (security, software engineering, creative challenges)
  • Jan-Lucas Uslu — Task creator (security, hardware)
  • Junhong Shen — Task creator (system administration, data science)
  • Karl Krauth — Task creator (biology, scientific computing)
  • jeffreywpli, dwahdany — Task creators (data processing, ML)
  • And many other contributors from Stanford, Google, and the broader research community
Resource Link
Website tbench.ai
Leaderboard tbench.ai/leaderboard/terminal-bench/2.0
Submission instructions HuggingFace: harborframework/terminal-bench-2-leaderboard
Harbor Framework harborframework.com

What Skills Does It Test?

Terminal Bench 2.0 tests a remarkably diverse set of real-world terminal skills — far beyond what traditional coding benchmarks cover:

graph TD
    TB["Terminal Bench 2.0<br/>89 tasks"] --> SWE["Software Engineering<br/>Build, compile, debug"]
    TB --> ML["Machine Learning<br/>Train models, inference"]
    TB --> SEC["Security<br/>Crack hashes, find vulns"]
    TB --> DS["Data Science<br/>Process, query, analyze"]
    TB --> SCI["Scientific Computing<br/>Statistics, biology, physics"]
    TB --> SYS["System Administration<br/>Servers, VMs, configs"]

    style TB fill:#e74c3c,color:#fff,stroke:#333
    style SWE fill:#3498db,color:#fff,stroke:#333
    style ML fill:#27ae60,color:#fff,stroke:#333
    style SEC fill:#8e44ad,color:#fff,stroke:#333
    style DS fill:#f39c12,color:#fff,stroke:#333
    style SCI fill:#e67e22,color:#fff,stroke:#333
    style SYS fill:#6cc3d5,color:#fff,stroke:#333

Category Example Tasks Difficulty
Software engineering Build POV-Ray from source, write a MIPS interpreter, implement pipeline parallelism in PyTorch Easy–Hard
Machine learning Train a FastText model on Yelp data, implement LLM inference batching scheduler, recover PyTorch model architecture Medium–Hard
Security Crack a 7z hash, exploit XSS filter bypasses, extract secrets from binaries, perform differential cryptanalysis Medium–Hard
Data science Reshard C4 dataset, merge multi-source data, optimize SQL queries, set up HuggingFace model inference Medium
Scientific computing DNA assembly primer design, Raman spectrum fitting, MCMC sampling with Stan, adaptive rejection sampling Medium–Hard
System administration Configure git webserver with auto-deploy, run Windows 3.11 in QEMU, set up mailing list servers, compile CompCert Medium–Hard
Debugging Fix OCaml garbage collector, resolve C++ heap crashes, recover corrupted SQLite databases Medium–Hard

What Makes These Tasks Hard?

Unlike isolated coding challenges, Terminal Bench tasks require agents to:

  1. Navigate complex environments — install dependencies, configure build systems, manage services
  2. Chain multiple skills — a single task might require downloading, compiling, configuring, and verifying
  3. Handle real-world messiness — legacy code (COBOL modernization), obscure formats (G-code), corrupted data (WAL recovery)
  4. Demonstrate deep domain knowledge — from molecular biology (DNA assembly) to cryptography (FEAL attacks) to retro computing (Windows 3.11)

Current Leaderboard

The leaderboard below shows the top-performing agent–model combinations on Terminal Bench 2.0, ranked by % Resolved (percentage of 89 tasks completed successfully).

Source: Terminal Bench 2.0 Leaderboard (consulted July 2025). 120 total entries. Results verified by Terminal Bench team members.

Rank Agent Model Organization % Resolved
1 ForgeCode Claude Opus 4.6 ForgeCode / Anthropic 81.8 ± 1.7
1 ForgeCode GPT-5.4 ForgeCode / OpenAI 81.8 ± 2.0
3 TongAgents Gemini 3.1 Pro BIGAI / Google 80.2 ± 2.6
4 ForgeCode Gemini 3.1 Pro ForgeCode / Google 78.4 ± 1.8
5 SageAgent GPT-5.3-Codex OpenSage / OpenAI 78.4 ± 2.2
6 Droid GPT-5.3-Codex Factory / OpenAI 77.3 ± 2.2
7 Capy Claude Opus 4.6 Capy / Anthropic 75.3 ± 2.4
8 Simple Codex GPT-5.3-Codex OpenAI 75.1 ± 2.4
9 Terminus-KIRA Gemini 3.1 Pro KRAFTON AI / Google 74.8 ± 2.6
10 Terminus-KIRA Claude Opus 4.6 KRAFTON AI / Anthropic 74.7 ± 2.6
11 Mux GPT-5.3-Codex Coder / OpenAI 74.6 ± 2.5
12 MAYA-V2 Claude Opus 4.6 ADYA / Anthropic 72.1 ± 2.2
13 TongAgents Claude Opus 4.6 BIGAI / Anthropic 71.9 ± 2.7
14 Junie CLI Multiple JetBrains 71.0 ± 2.9
15 CodeBrain-1 GPT-5.3-Codex Feeling AI / OpenAI 70.3 ± 2.6
16 Droid Claude Opus 4.6 Factory / Anthropic 69.9 ± 2.5
17 Ante Gemini 3 Pro Antigma Labs / Google 69.4 ± 2.1
18 IndusAGI GPT-5.3-Codex SoloVpx / OpenAI 69.1 ± 2.3
19 Crux Claude Opus 4.6 Roam / Anthropic 66.9
20 Mux Claude Opus 4.6 Coder / Anthropic 66.5 ± 2.5

Key takeaway: The top agents now solve over 80% of Terminal Bench 2.0 tasks, but the hardest ~20% — involving deep domain expertise in cryptography, biology, and complex systems — remain largely unsolved. The leaderboard features 120 entries from diverse organizations, with specialized agent frameworks (ForgeCode, Droid, TongAgents) consistently outperforming general-purpose CLI tools.

For the full, up-to-date leaderboard, visit the links in the next section.

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource Description Link
Terminal Bench 2.0 Leaderboard Full ranked leaderboard with all 120 entries, agents, and models tbench.ai/leaderboard/terminal-bench/2.0
Task Registry Browse all 89 tasks with descriptions, categories, and difficulty tbench.ai/benchmarks/terminal-bench-2
Terminal Bench Home Overview of all benchmark versions and upcoming releases tbench.ai

Submission and Evaluation

Resource Description Link
Submission Instructions How to submit your agent to the leaderboard via HuggingFace HF: harborframework/terminal-bench-2-leaderboard
Harbor Framework The evaluation framework used to run Terminal Bench harborframework.com

Run the Benchmark

# Install Harbor and run Terminal Bench 2.0
harbor run -d terminal-bench@2.0 -a "your-agent" -m "your-model" -k 5

Understanding the Metrics

% Resolved

The primary metric. Each task is binary — either fully completed (verified by automated checks) or not. The score is the percentage of 89 tasks that the agent resolved successfully.

Confidence Intervals

Each leaderboard entry includes a confidence interval (± value) reflecting variance across evaluation runs. Smaller intervals indicate more consistent agent performance.

Agent vs. Model

Terminal Bench uniquely separates the agent framework (e.g., ForgeCode, Droid, Claude Code) from the underlying model (e.g., Claude Opus 4.6, GPT-5.4). This reveals that:

  • The same model performs very differently across agent frameworks
  • Specialized agent frameworks consistently outperform general-purpose tools
  • The agent scaffolding matters as much as the model capability

graph LR
    A["Model Capability<br/>(reasoning, knowledge)"] --> C["Task Resolution"]
    B["Agent Framework<br/>(tool use, planning)"] --> C
    C --> D["% Resolved<br/>on Terminal Bench"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333

Why Terminal Bench Matters

graph LR
    A["Code-only<br/>benchmarks"] --> B["Don't test<br/>system mastery"]
    B --> C["Terminal Bench<br/>fills the gap"]
    C --> D["Measures real<br/>agent autonomy"]

    A2["Isolated<br/>task evals"] --> B2["Miss multi-step<br/>complexity"]
    B2 --> C
    C --> D2["Drives agent<br/>framework innovation"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Tests real autonomy — Agents must navigate full environments, not just generate code snippets
  2. Broad skill coverage — From biology to cryptography to system administration, no single skill suffices
  3. Separates agent from model — Reveals that scaffolding and tool use matter as much as raw model capability
  4. Practical relevance — Tasks mirror what engineers actually do in terminals every day
  5. Anti-contamination — Canary strings and automated verification prevent benchmark gaming

Video: Terminal Bench 2.0 Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Terminal Bench 2.0 sets a new standard for evaluating AI agents in real-world terminal environments:

  • 89 expert-crafted tasks spanning software engineering, ML, security, data science, scientific computing, and system administration
  • Built as a Stanford x Laude collaboration with tasks from leading researchers including Nicholas Carlini
  • Evaluated via the Harbor framework — reproducible, containerized, and open for submissions
  • The best agents solve ~82% of tasks, but the hardest challenges in cryptography, biology, and complex systems remain unsolved
  • The benchmark uniquely separates agent framework from model, revealing that scaffolding matters as much as raw capability

With Terminal Bench 3.0 and Terminal Bench Science already in development, this benchmark family is rapidly evolving to keep pace with agent capabilities — ensuring we have a rigorous measure of what AI agents can truly accomplish when given a terminal and a goal.

References

Read More